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Abstract 

The paper addresses the problem of learning a regression model parameterized by a fixed-rank 
positive semidefinite matrix. The focus is on the nonlinear nature of the search space and on 
scalability to high-dimensional problems. The mathematical developments rely on the theory 
of gradient descent algorithms adapted to the Riemannian geometry that underlies the set 
of fixed-rank positive semidefinite matrices. In contrast with previous contributions in the 
literature, no restrictions are imposed on the range space of the learned matrix. The resulting 
algorithms maintain a linear complexity in the problem size and enjoy important invariance 
properties. We apply the proposed algorithms to the problem of learning a distance function 
parameterized by a positive semidefinite matrix. Good performance is observed on classical 
benchmarks. 

1 Introduction 

A fundamental problem of machine learning is the learning of a distance between data samples. 
When the distance can be written as a quadratic form (either in the data space (Mahalanobis 
distance) or in a kernel feature space (kernel distance)), the learning problem is a regression 
problem on the set of positive definite matrices. The regression problem is turned into the 
minimization of the prediction error, leading to an optimization framework and gradient-based 
algorithms. 

The present paper focuses on the nonlinear nature of the search space. The classical frame- 
work of gradient-based learning can be generalized provided that the nonlinear search space 
is equipped with a proper Riemannian geometry. Adopting this general framework, we design 
novel learning algorithms on the space of fixed-rank positive semidefinite matrices, denoted 
by S+(r, d), where d is the dimension of the matrix, and r is its rank. Learning a parametric 
model in S+(r,d) amounts to jointly learn a r-dimensional subspace and a quadratic distance 
in this subspace. 
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The framework is motivated by low-rank learning in large-scale applications. If the data 
space is of dimension d, the goal is to maintain a linear computational complexity 0(d). In 
contrast to the classical approach of first reducing the dimension of the data and then learning 
a distance in the reduced space, there is an obvious conceptual advantage to perform the two 
tasks simultaneously. If this objective can be achieved without increasing the numerical cost 
of the algorithm, the advantage becomes also practical. 

Our approach makes use of two quo tient geometries of the set S+(r±d) that have been 



recently studied by iJournee et al.l 20101 ] and lBonnabel and Sepulchre! 120091. Ma k ing u se of 



a general theory of line-search algorithms in quotient matrix spaces [Absil et all 120081 ] . we 



obtain concrete gradient updates that maintain the rank and the positivity of the learned 
model at each iteration. This is because the update is intrinsically constrained to belong to 
the nonlinear search space, in contrast to early learning algorithms that neglect th e non linear 



nature of the search space in the u pdate and impose the constraints a posteriori Xing et al 
2001 iGloberson and Roweisl . 1200^ . 



Not surprisingly, our approach has close connections with a number of recent contributions 
on learning a lgorithms. Learning problems ove r nonlinear matrix spaces i nclud e the learning 
of subspaces ICrammerl. 12 006 
definite matrices 



Tsuda et al. 



Warmuth . 2007 ]. rotation matrices Arora . 20091 ] . and positive 



J. The space of (full-rank) positive definite matrices S+(d) 
is of particular interest since it coincides with our set of interest in the particular case r = d. 
The use of Bregm an divergences and alternating projection has been recently investigated 



for learning in S+(d). iTsuda et al.l 20051 ] propose to use the vo n Neumann divergence, resu lt- 
ing in a generalization of the well-known AdaBoost algorithm Schapire and Singer , Il999l ] to 
positive definite matrices. The use of the so-called LogDet divergence has also been investi- 



gated by iDavis et al.l [20071 ] in the context of Mahalanobis distance learning. 



More recently, algorithmic work has focused on scalability in terms of dimensionality and 
data set size. A natural extension of the previous work on positive definite matrices is thus 
to consider low-rank positive semidefinite matrices. Indeed, whereas algorithms based on full- 
rank matrices scale as 0(d 3 ) and require 0(d 2 ) storage units, algor i thms based on low-rank 
matr ices scale as 0(dr 2 ) and require 0(dr) storage units Fine et all l200ll . iBach and Jordan! . 



20051 ] . This is a significant complexity reduction as the approximation rank r is typically very 



small compared to the di mension of the problem d. 

Extending the work of lTsuda et"al] |2005| . iKulis et al.l f2009] recently considered the learn- 
ing of positive semidefinite matrices. The authors consider Bregman divergence measures that 
enjoy convexity properties and lead to updates that preserve the rank as well as the positive 
semidefinite property. However, these divergence-based algorithms intrinsically constrain the 
learning algorithm to a fixed range space. A practical limitation of this approach is that the 
subspace of the learned matrix is fixed beforehand by the initial condition of the algorithm. 

The approach proposed in the present paper is in a sense more classical (we just perform 
a line-search in a Riemannian manifold) but we show how to interpret Bregman divergence 
based algorithms in our framework. This is potentially a contribution of independent interest 
since a general convergence theory exists for line-search algorithms on Riemannian manifolds. 
The generality of the proposed framework is of course motivated by the non-convex nature of 
the rank constraint. 

The paper is organized as follows. Section[2]presents the general optimization framework of 
Riemannian learning. This framework is then applied to the learning of subspaces (Section 0]), 
positive definite matrices (Section [5]) and fixed-rank positive semidefinite matrices (Section [6]). 
The novel proposed algorithms are presented in Section [7] Section [8] discusses the relationship 
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to existing work as well as extensions of the proposed approach. Applications are presented 
in Section [9] and experimental results are presented in Section [TU1 



2 Linear Regression on Riemannian Spaces 

We consider the following standard regression problem. Given 

(i) data points X, in a linear data space X = M. dxd 1 

(ii) observations y, in a linear output space y = M., (or Mr), 

(iii) a regression model y = yw(X) parameterized by a matrix W in a search space W, 

(iv) a quadratic loss function £(y,y) = \{y — y) 2 , 

find the optimal fit W* that minimizes the expected cost 

F(W) = E^ y {£(y,y)} = J £(y,y) dP(K,y), 

where £{y, y) penalizes the discrepancy between observations and predictions, and -P(X, y) is 
the (unknown) joint probability distribution over data and observation pairs. Although our 
main interest will be in the scalar model 

y = Tr(WX), 

the theory applies equally to vector data points x € M. d , y = Tr(Wxx T ) = x T Wx, to a 
regression model parameterized by a vector w S M d , y = w T x, or to a vector output space 
y = Wx. 

As it is generally not possible to compute .F(W) explicitly batch learning algorithms 
minimize instead the empirical cost 
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2n 

which is the average loss comp uted oyer a fi nite number of samples {(Xj,yj)}^L 1 . 

Online learning algorithms [ Bottoul . l2004l | consider possibly infinite sets of samples {(Xt, yt)}t>i, 
received one at a time. At time t, the online learning algorithm minimizes the instantaneous 
cost 

ft{w) = \{yt-ytf. 

In the sequel, we only present online versions of algorithms to shorten the exposition. The 
single necessary change to convert an online algorithm into its batch counterpart is to perform, 
at each iteration, the minimization of the empirical cost f n instead of the minimization of the 
instantaneous cost ft- In the sequel, we denote by / the cost function that is minimized at 
each iteration. 

Our focus will be on nonlinear search spa ces W. We only re quire W to have the structure 
of a Riemannian matrix manifold. Following lAbsil et al.l [2008], an abstract gradient descent 
algorithm can then be derived based on the update formula 

W t +i = R Wt (-8t grad/(W t )). (2) 
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The gradient grad/(Wj) is an element of the tangent space Tw t W. The scalar st > is the 
step size. The retraction i?w t is a mapping from the tangent space Tw t W to the Riemannian 
manifold. Under mild conditions on the retraction R, the classical converge nce theory of line; 



search algorithms in linear spaces generalizes to Riemannian manifolds [see I Absil et all 120081 . 
Chapter 4]. 

Observe that the standard (online) learning algorithm for linear regression in R , 

Wt+i = w t - s t (wfx t - y t )xt, (3) 

can be interpreted as a particular case of ([2|) for the linear model y = w T x in the linear search 
space W = R d . The Euclidean metric turns R in a (flat) Riemannian manifold. For a scalar 
function / : R d -> 1 of w, the gradient satisfies 

£>/(w)[<$] = <5 T grad/(w), 

where Df(yv)[5] is the directional derivative of / in the direction S, and the natural retraction 

Rw t (-s t grad/(w t )) = w t - s t grad/(w t ), 

induces a line-search along "straight lines" which are geodesies (that is paths of shortest length) 
in linear spaces. With /(w) = ±(w T x-y) 2 , one arrives at (|3J). 

This example illustrates that the main ingredients to obtain a concrete algorithm are con- 
venient formulas for the gradient and for the retraction mapping. This paper provides such 
formulas for three examples of nonlinear matrix search spaces: the Grassmann manifold (Sec- 
tion HJ, the cone of positive definite matrices (Section [5]), and the set of fixed-rank positive 
semidefinite matrices (Section [6]). Each of those sets will be equipped with quotient Rieman- 
nian geometries that provide convenient formulas for the gradient and for the retractions. 
Line-search algor ithms in quotient Riemannian spaces are discussed in detail in the book of 



Absil et al.l [2008fl. For the readers convenience, basic concepts and notations are introduced 



in the next section. 



3 Line- Search Algorithms on Matrix Manifolds 



This section summarizes the exposition of Absil et al. 20081 . chap. 3 and 4] 



Restrictions on the search space are generally encoded into optimization algorithms by 
means of particular constraints or penalties expressed as a function of the search variable. 
However, when the search space is endowed with a particular manifold structure, it is possible 
to design an exploration strategy that is consistent with the geometry of the problem and that 
appropriately turns the problem into an unconstrained optimization problem. This approach 
is the purpose of optimization algorithms defined on matrix manifolds. 

Informally, a manifold W is a space endowed with a differentiable structure. One usually 
makes the distinction between embedded submanifolds (subsets of larger manifolds) and quo- 
tient manifolds (manifolds described by a set of equivalence classes). An intuitive example of 
embedded submanifold is the sphere embedded in R d . A typical example of quotient manifold 
is the set of r-dimensional subspaces in M. d , viewed as a collection of r-dimensional orthogonal 
frames that cannot be superposed by a rotation. The rotational variants of a given frame thus 
define an equivalence class (denoted using square brackets [•]), which is identified as a single 
point on the quotient manifold. 
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Figure 1: Gradient iteration on a Riemannian manifold. The search direction — grad/(W^) 
belongs to the tangent space Tw t W. The updated point W^j automatically remains inside 
the manifold thanks to the retraction mapping. 



To develop line-search algorithms, the notion of gradient of a scalar cost function needs 
to be extended to manifolds. For that purpose, the manifold W is endowed with a metric 
<?w(£W) Cw)> which is an inner product defined between elements £w,Cw 01 the tangent 
space TwW at W. The metric induces a norm on the tangent space TwW at W: 



HCwIlw = \/5w(£w,£w)- 

The gradient of a smooth scalar function / : W — > M. at W G W is the only element 
grad/(W) e T W W that satisfies 

Df (W)[A) = 5w (A,grad/(W)), VA € T W W, 

where A is a matrix representation of a "geometric" tangent vectors £, and where 

t^o t 

is the standard directional derivative of / at W in the direction A. 

For quotient manifolds W = W/ ~, where W is the total space and ~ is the equivalence 
relation that defines the quotient, the tangent space TpwjW at [W] is sufficiently described 
by the directions that do not induce any displacement in the set of equivalence classes [W] . 
This is achieved by restricting the tangent space at [W] to horizontal vectors £w E TwW 
at W that are orthogonal to the equivalence class [W]. Provided that the metric in the 
total space is invariant along the equivalence classes, it defines a metric in the quotient space 

9[W] (£[W] ! C[W] ) — 9w (fw > Cw) ■ 



The horizontal gradient grad/(W) is obtained by projecting the gradient grad/(W) in the 
total space onto the set of horizontal vectors £w at W. 

Natural displacements at W in a direction £w on the manifold are performed by following 
geodesies (paths of shortest length on the manifold) starting from W and tangent to £w This 
is performed by means of the exponential mapping 

W m = Exp Wt Ot£ w J, 
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which induces a line-search algorithm along geodesies. 

A more general update formula is obtained if we relax the constraint of moving along 
geodesies. The retraction mapping 

Wt+i = -RwtO^wJ, 

locally approximates the exponential mapping. It provides an attractive alternative to the 
exponential mapping in the design of optimization algorithms on manifolds, as it reduces the 
computational complexity of the update while retaining the essential properties that ensure 
convergence results. When £w t coincide with — grad/(W() a gradient descent algorithm on 
the manifold is obtained. Figure [T] pictures a gradient descent update on W. 



4 Linear Regression on the Grassmann Manifold 



Crammer . 


2006, 


Warmuth . 


2007 



and consider 



the linear model 

y = UU T x, 

with U G St(r, d) = {U G R dxr s.t. U T U = I}, the Stiefel manifold of r-dimensional or- 
thonormal bases in M. d . The quadratic loss is then 



/(U) = *(y,x) = ~||y - x||l = ^||UU T x - x|||. 



(4) 



Because the cost (j4]) is invariant by orthogonal transformation U i— > UO, O G 0(r), where 
0{r) = St(r, r) is the orthogonal group, the search space is in fact a set of equivalence classes 

[U] = {UO s.t. O G 0(r)}. 

This set is denoted by St(r, d)/0{r). It is a quotient representation of the set of r-dimensional 
subspaces in IR^, that is, the Grassmann mani f old Gr(r, d). The q uotient geometries of Gr(r, d) 
have been well studied Edelman et all Il998l . lAbsil et all |2004| . The metric 

9[V] (£[U] j C[U] ) — 5u(Cu,Cu), 

is induced by the standard metric in R dxr , 

ffu(Ai,A 2 ) = Tr(AfA 2 ), 

which is invariant along the fibers, that is, equivalence classes. Tangent vectors Gtn a ^ [U] 
are represented by horizontal tangent vectors £u at U: 



= hu A = (I - UU T )A, A G 



rjdxr 



Therefore, the gradient admits the simple horizontal representation 



grad/(U) = IIu grad/(U), 
where grad/(U) is defined by the identity 

£>/(U)[A]=5u(A,grad/(U)). 



(5) 
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A standard retraction in Gr(r, d) is the exponential mapping, that induces a line-search 
along geodesies. The exponential map has the closed-form expression 

Expu(^u) = UVcos(S)V T + Zsin(5])V :r , (6) 



which is o btained from a sin gular value decomposition of the horizontal vector £u = ZSV 



T 



Following lAbsil et al.l 20041 ] . an alternative convenient retraction in Gr(r, d) is given by 



Ru(stu) = [U + sfu] = qf (U + s£u), (7) 

where qf(-) is a function that extracts the orthogonal factor of the QR-decomposition of its 
argument. A possible advantage of the retraction ([7]) over the retraction ([6]) is that, in contrast 
to the SVD computation, the QR decomposition is computed in a fixed number 0(dr 2 ) of 
arithmetic operations. 

With the formulas ([5|) and (|7|) applied to the cost function the abstract update ([2]) 
becomes 



U m = qf(U t + s t (I - V t Vf )xtxf U 



t), 



which is Oja's update for subspace tracking [Ojaj, Il992| 



5 Linear Regression on the Cone of Positive Definite Matrices 

The learning of a full-rank positive definite matrix is recast as follows. Let X = M. dxd and 
y = K, and consider the model 

y = Tr(WX), 

with W G S+(d) = {W G R dxd s.t. W = W T y 0}. Since W is symmetric, only the sym- 
metric part of X will contribute to the trace. The previous model is thus equivalent to 

y = Tr(WSym(X)), 

where Sym(-) extract the symmetric part of its argument, that is, Sym(B) = (B T + B)/2. 
The quadratic loss is 

/(W) = £(y,y) = i(Tr(WSym(X)) - yf . 

The quotient geometries of S+(d) are rooted in the matrix factorization 

W = GG T , G G GL(d), 

where GL(d) is the set of all invertible d x d matrices. Because the factorization is invariant 
by rotation, G \-t GO, O G 0(d), the search space is once again identified to the quotient 

S+(d) ~ GL(d)/0(d), 

which represents the set of equivalence classes 

[G] = {GO s.t. O G 0(d)}. 

We will equip this quotient with two meaningful Riemannian metrics. 
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5.1 A Flat Metric on S+(d) 

The metric on the quotient GL(d)/0(d): 

9[G] (£[G] > C[G] ) = 9g (Cg j Cg ) j 

is induced by the standard metric in M dxa! , 

5 G (A 1 ,A 2 ) = Tr(AfA 2 ), (8) 

which is invariant by rotation along the set of equivalence classes. As a consequence, it induces 
a metric g[ G ] on ^+( £ 0- With this geometry, a tangent vector £[g] a ^ [G] is represented by a 
horizontal tangent vector £ G at G by 

£ G = Sym(A)G, A G M dxd . 

The horizontal gradient of 

/(G) = £(y,y) = ^(Tr(GG T Sym(X)) - y)\ (9) 



is the unique horizontal vector grad/(G) that satisfies 



£>/(G)[A]=3 G (A,grad/(G)). 
Elementary computations yield 



grad/(G) = 2(y-y)Sym(X)G. 
Since the metric is flat, geodesies are straight lines and the exponential mapping is 

Exp G (£ G ) = [G + £ G ] = G + £ G . 
Those formulas applied to the cost ([9]) turns the abstract update ([2]) into the simple formula 

Gt+i = G t - 2s t (y t - y t )Sym(X t )G t , (10) 
for an online gradient algorithm and 

G m = G t - 2s t - VYfe - y l )Sym(X l )Gf, (11) 

T7. ^— ' 



i=i 



for a batch gradient algorithm. 

5.2 The Affine-Invariant Metric on S + (d) 

Because S+(d) — GL{d) /0{d) is t he quotient of two Lie gro ups, its (reductive) geometric 



structure can be further exploited [Faraut and Koranvil . Il994j |. Indeed the group GL(cT) has 



a natural action on S+(d) via the transformation W h-> AWA T for any A G GL(rf). The 
affine-invariant metric admits interesting invariance properties to these transformations. To 
build such an affine-invariant metric, the metric at identity 

<?l(£l,Cl)=Tr(aCi), 
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is extended to the entire space to satisfy the invariance property 

5i(£i,Ci) =5w(W^iWi,wiCiW5)= 5w (e w ,Cw)- 
The resulting metric on S+(d) is defined by 

<?w(£w, Cw) = h^wW^CwW- 1 ). (12) 

The affine-invariant ge ometry of S+ (d) has been well studied, in particular in the context of 
information geometry Smithl . l2005l |. Indeed, any positive definite matrix W 6 S+(d) can be 
identified to the multivariate normal distribution of zero mean Af(0, W). whose probability 
density is p(z;W) = exp(— ^z T W _1 z), where Z is a normalizing constant. Using such 
a metric allows to endow the space of parameters S+(d) with a distance that reflects the 
proximity of the probability distributions. The Riemannian metric thus distorts the Euclidean 
distances between positive definite matrices in order to reflect the amount of information 
between the two associated probability distributions. If £w is a tangent vector to W £ S+(d), 
we have the following approximation for the Kullback-Leibler divergence (up to third order 
terms) 

D KL (p(z; W)\\p(z; W + £ w )) « ± <$f M (£w,£w) = ^ Sw(£w,£w), 

where g^ M is the well -known Fishe r information metric at W, which coincides with the affine- 
invariant metric (fT2|) [S mithl . |2005| |. With this geometry, tangent vectors £w are expressed 
as 

£ w = W5Sym(A)W5, A G R dxd . 
The gradient grad/(W) is given by 

£>/(W)[A] =5w(A,grad/(W)). 
Applying this formula to ([5]) yields 

grad/(W) = (y - y)WSym(X)W. (13) 
The exponential mapping has the closed-form expression 

Exp w (£ w ) = W5 exp(W-5£ w W^)wi (14) 
Its first-order approximation provides the convenient retraction 

iM*fw) = W - a£w ( 15 ) 
The formulas (|13p and (fl~4l) applied to the cost ([5]) turn the abstract update ([2]) into 

W t+ i = Wf exp(-a t (ft - y f )Wf Sym(X t )w|)w|. 
With the alternative retraction (|15p . the update becomes 

Wt+i = W t - s t (fc - y t )W t Sym(X t )W t , 
which is the update of Davis et al. 2007| based on the LogDet divergence (see Section fe.ip . 
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5.3 The Log-Euclidean Metric on S + (d) 

For the sake of completeness, we briefly review a third Riemannian geometry of S+(d), that 
exploits the property 

W = exp(S), S = S T E R dxd . 

The matrix exponential thus provides a global diffeomorphism between S+(d) and the lin- 
ear space of d x d symmetric matrices. This geometry is studied in detail in the paper 



Arsignv et all |2007| | . The cost function 

/(S) = £(y,y) = i(Tr(exp(S)Sym(X)) - y) 2 , 

thus defines a cost function in the linear space of symmetric matrices. The gradient of this 
cost function is given by 

grad/(S) = (y t - j/ f )Sym(X t ), 

and the retraction is 

#s(s£s) = exp(log W + s£ s ). 
The corresponding gradient descent update is 

Wt+i = exp(log W( - s t (y t - j/t)Sym(X t )), 



which is the update of iTsuda et al.l 20051 ] based on the von Neumann divergence. 



6 Linear Regression on Fixed-Rank Positive Semidefinite Ma- 
trices 

We now present the proposed generalizations to fixed-rank positive semidefinite matrices. 
6.1 Linear Regression with a Flat Geometry 

The generalization of the results of Section 15.11 to the set S+ (r, d) is a straightforward conse- 
quence of the factorization 

W = GG T , G G Rt xr , 

where R dxr = {G G M dxr s.t. det(G T G) / 0}. Indeed, the flat quotient geometry of the man- 
ifold S+(d) ~ GL{d) /0{d) is generalized to the quotient geometry of S + (r,d) ~ M. dxr /0{r) 
by a mere adaptation of matrix dimension, leading to the updates (|10|) and (|lip for matri- 
ces Gf G ]R^ xr . The mathematical d erivation of these up dates is a straight application of 
the material presented in the paper of Jour nee et al. 2010l ] . where the quotient geometry of 



S+(r,d) ~ W dxr jO{r) is studied in details. In the next section, we propose an alternative 
geometry that jointly learns a r-dimensional subspace and a full-rank quadratic model in this 
subspace. 
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6.2 Linear Regression with a Polar Geometry 

In contrast to the flat geometry, the affine-invariant geometry of S+(d) — GL{d)/0{d) does 
not generalize directly to S+(r,d) ~ R^ xr fO(r) because R^ xr is not a group. However, a 
generalization is possible by considering the polar matrix factorization 

G = UR, U G St(r, d), R G S+(r). 

It is obtaine d from the singular value d ecomposition of G = ZSV T as U = ZV T and 
R = VSV T [Golub and Van Loanl . Il996| . This gives a polar parametrization of S + (r, d) 

W = UR 2 U T . 

This development leads to the quotient representation 

S+(r,d) ~ (St(r,d) x S + (r))/0(r), (16) 

based on the invariance of W to the transformation (U, R 2 ) i-> (UO, T R 2 0), O E O(r). It 
thus describes the set of equivalence classes 

[(U,R 2 )] = {(UO,0 T R 2 0) s.t. O G 0(r)}. 

The cost function is now given by 

/(U, R 2 ) = £(y, y) = i(Tr(UR 2 U T Sym(X)) - yf . (17) 



The Riemannian geometry of (|16p has been recently studied by iBonnabel and Sepulchre 
2009(| . A tangent vector £[~w] = (£u> £r 2 )[u,R 2 ] a ^ [(U, R 2 )] is described by a horizontal 



tangent vector £ w = (£u, £r2)(u,r 2 ) at (U,R 2 ) by 

£u = nuA, A E R dxr , ^R2 = RSym(*)R, * G M rxr . 

The metric 

5[w] (£[w] > C[w] ) — 5w (fw , Cw ) 

= ^ <?u(£u,Cu) + :j— ^ 5r2(Cr2,Cr2), (18) 

where A G (0, 1), is induced by the metric of St(r, d) and the affine-invariant metric of S+(r), 
&j(Ai, A 2 ) = Tr(Af A 2 ), 3 R2 (*i, * 2 ) = Tr(*iR- 2 ¥ 2 R~ 2 ). 

The proposed metric is invariant along the set of equivalence classes and thus induces a 
quotient structure on S+(r,d). Alternative metrics on S+(r) can be considered as long as the 
metric remains invariant along the set of equivalence classes. For instance, the log-Euclidean 
metric discussed in Section T5.3I would qualify as a valid alternative. 
A retraction is provided by distinct retractions on U and R 2 , 

Ru(sZu) = qf(U + ^u) (19) 
^r2(s£r2) = Rexp(sR~ 1 ^ R2 R~ 1 )R. (20) 
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One should observe that this retraction is not the exponential mapping of S+ (r, d) . This 
illustrates the interest of considerin g more general retractions than the exponential mapping. 



Indeed, as discussed in the paper of lBonnabel and Sepulchre] [20091 ] . the geodesies (and there 



fore the exponential mapping) do not appear to have a closed form in the considered geometry. 
Combining the gradient of (|17p with the retractions (fT9|) and (|20p gives 

U t+ i = qf (Ut - 2Xs t (y t - y t )(I - UiU^)Sym(X t )U i Rf) , 
Rf +1 = K t exp (-(1 - X)s t {yt - y t )H t VjSym(X t )V t Tit) R f . 

A factorization Rf+iR^ of R 2 +1 is obtained thanks to the property of matrix exponential, 

exp(A)2 = exp(^A). Updating Rj+i instead of R 2 +1 is thus more efficient from a computa- 
tional point of view, since it avoids the computation of a square root a each iteration. This 
yields the online gradient descent algorithm 

U t+ i = qf {U t - 2Xs t (y t - y t ){I - U t Uf)Sym(X t )U t R2) , 

H , + ^ R4E cp(-I( 1 -A, ai (i 1 - 9l )R J urs ym( X 1 )U l ^ (21) 

and the batch gradient descent algorithm 
Ut+i 



qf (u t ~ 2As t ~ Ytfi - yi )(I - U t Uf )Sym(X i )UiR| j , 
R+i = R t exp - X)s t ^ - yi)R t Ul Sym(X i )U t R t ^ 



(22) 



7 Algorithms 

This section documents implementation details of the proposed algorithms. Generic pseudo- 
codes are provided in Figure [2] and Table [T] summarizes computational complexities. 



Data type 


Input space Batch flat (|TTj) 


Batch polar 


} Online flat (fTU| 


Online polar (12Tj) 


X 

T 

XX 


R dxd 0(d 2 rn) 
R d 0(drn) 


0{d 2 r 2 n) 
0{dr 2 n) 


0(d 2 rp) 
0(drp) 


0{d 2 r 2 p) 
0{dr 2 p) 



Table 1: Computational costs of the proposed algorithms. 



7.1 From Subspace Learning to Distance Learning 

The update expressions (|22p and (|2ip show that A, the tuning parameter of the Riemannian 
metric (JTSJ), acts as a weighting factor on the search direction. A proper tuning of this 
parameter allows us to place more emphasis either on the learning of the subspace U or on 
the distance in that subspace R 2 . In the case A = 1, the algorithm only performs subspace 
learning. Conversely, in the case A = 0, the algorithm learns a distance for a fixed range space 
(see Section I8.ip . Intermediate values of A continuously interpolate between the subspace 
learning problem and the distance learning problem at fixed range space. 



12 



Batch regression 



Online regression 



Input: {(Xi, 

Require: Go or (Uo,Ro), A 
1: t = 
2: repeat 

3: 
4: 
5: 
6: 

7: Compute Armijo step sa from (|23l) 
8: Perform update or l]22[l using sa 
9: 
10: 

11: t = t + l 



Input: {(X t ,t/ t )} t >i 

Require: Go or (Uo,Ro), A, p, s, to, T 

1: t = 0, count = p 

2: while t < T do 

3: if count > then 

4: Accumulate gradient 

5: count = count — 1 

6: else 

7: Compute step size s t according to (|24[) 

8: Perform update (|10|) or (|21|) using St 

9: count = p 

10: end if 

11: t = t + l 

12: end while 

13: return Gt 



12: until stopping criterion (|25p is satisfied 
13: return G t 



Figure 2: Pseudo-codes for the proposed batch and online algorithms. 



A proper tuning of A is of interest when a good estimate of the subspace is available (for 
instance a subspace given by a proper dimension reduction technique) or when too few obser- 
vations are available to jointly estimate the subspace and the distance within that subspace. 
In the latter case, one has the choice to favor either subspace or distance learning. 

Experimental results of Section [TU1 recommend the value A = 0.5 as the default setting. 

7.2 Invariance Properties 

A nice property of the proposed algorithms is their invariance with respect to rotations W i— > 
O t WO, VO G 0{d). This invariance comes from the fact that the chosen metrics are invariant 
to rotations. A practical consequence is that a rotation of the input matrix X i— > OXO T (for 
instance a whitening transformation of the vectors x i— > Ox if X = xx r ) will not affect the 
behavior of the algorithms. 

Besides being invariant to rotations, algorithms (i2~Tj) and (122j) are invariant with respect to 
scalings W i-> /xW with fx > 0. Consequently, a scaling of the input data (X, y) i— > (LiX.,[J,y), 
such as a change of units, will not affect the behavior of these algorithms. 

7.3 Mini-Batch Extension of Online Algorithms 

We consider a mini-batch extension of stochastic gradient algorithms. It consists in perform- 
ing each gradient step with respect to p > 1 examples at a time instead of a single one. 
This is a classical speedup and stabilization heuristic for stochastic gradient algorithms. In 
the particular case p = 1, one recovers plain stochastic gradient descent. Given p samples 
(X^i,ytl), (Xtp,yt p ), received at time t, the abstract update ([2]) becomes 



7.4 Strategies for Choosing the Step Size 

We here present strategies for choosing the step size in both the batch and online cases. 
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7.4.1 Batch Algorithms 



For batch algorithms, classical backtracking methods exist [see lNocedal and Wrightl . |2006| . In 



this paper, we use the Armijo step sa defined at each iteration by the condition 

f(Rw t (s A grad/(Wt))) < /(W t ) + c||grad/(W t )||^ t , (23) 

where W[ G S+(r, d) is the current iterate, c G (0, 1), / is the empirical cost ([I]) and i?w is the 
chosen retraction. In this paper, we choose the particular value c = 0.5 and repetitively divide 
by 2 a specified maximum step size s m ax until condition fj23p is satisfied for the considered 
iteration. In order to reduce the dependence on Smax in a particular problem, it is chosen 
inversely proportional to the norm of the gradient at each iteration, 

«0 



||grad/(W t )|| w / 

A typical value of sq = 100 showed satisfactory results for all the considered problems. 
7.4.2 Online Algorithms 

For online algorithms, the choice of the step size is more involved. In this paper, the step size 
schedule St is chosen as 

s ntn , 

^ = x -— °-, (24) 

Hgrad ™0 + £ 

where s > 0, n is the number of considered learning samples, fi gra d is an estimate of the 
average gradient norm ||grad/(Wo)||w > an d to > controls the annealing rate of St- During 
a pre-training phase of our online algorithms, we select a small subset of learning samples and 
try the values 2 k with k = —3, 3 for both s and to- The values of s and to that provide the 
best decay of the cost function are selected to process the complete set of learning samples. 

7.5 Stopping Criterion 

Batch algorithms are stopped when the value or the relative change of the empirical cost / is 
small enough, or when the relative change in the parameter variation is small enough, 

™ w /(W m ) - /(Wt) ||G m -G f || F 
/(Wt+i) < e tol, or /(W f ) et ° h ° r [| G ii ^ e *°l- ( 25 ) 

We found et \ = 10 -5 to be a good trade-off between accuracy and convergence time. 

Online algorithms are run for a fixed number of epochs (number of passes through the set 
of learning samples). Typically, a few epochs are sufficient to attain satisfactory results. 

7.6 Convergence 

Gradient descent algorithms on matrix manifolds share the well-characterized convergence 
properties of their analog in M d . Batch algorithms converge linearly to a local minimum of 
the empirical cost that depends on the initial condition. Online algorithms converge asymp- 
totically to a local minimum of the expected loss. They intrinsically have a much slower 
convergence rate than b atch algorithms, but they ge nerally decrease faster the expected loss 



in the large-scale regime Bottou and Bousquetl . l2007| . The main idea is that, given a training 
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set of samples, an inaccurate solution may indeed have the same or a lower expected cost than 
a well-optimized one. 

When learning a matrix W G S+(d), the problem is convex and the proposed algorithms 
converge toward a global minimum of the cost function, regardless of the initial condition. 
When learning a low-rank matrix W € S+ (r, d) , with r < d, the proposed algorithms converge 
to a local minimum of the cost function. This is not the case for heuristic methods proposed 
in the literature, which firs t reduce the dimensional i ty of the data before fitting a full-rank 
model on the reduced data [Davis and Dhillonl . 120081 . IWeinberger and Saull . [2009] . 

For batch algorithms, the local convergence results follow from the convergence theory of 



line-search algorithms on Riemannian manifolds [see, for example. lAbsil et all |2008( |. 



For online algorithms, one can prove that the algorithm based on the flat geometry enjoys 
almost sure asymptotic convergence to a local minimum of the expected cost. In that c ase, the 
param eter G belongs to an Euclidean space and the convergence results presented by iBottou 



19981 ] apply (see Appendix [A] for the main ideas of the proof). 



In contras t, when the polar parametrization is used, the convergence results presented by 
Bottou 19981 ] do not apply directly because of the quotient nature of the search space. Because 



the extension would require technical arguments beyond the scope of the present paper, we 
refrain from stating a formal convergence result for the online algorithm based on the polar 
geometry, even though the result is quite plausible. 

Due to the nonconvex nature of the considered rank-constrained problems, the conver- 
gence results are only local and little can be presently said about the global convergence of 
the algorithms. A global analysis of the critical points of the cost functions studied in the 
present paper is nevertheless not hopeless and could be facilitated by the considered low-rank 
parametrizations. For instance, global convergence properti es have been estab lished for PCA 
algorithms from an explicit analysis of the critical points Chen et al. . 19981 ] . Also, recent 



result s suggest good glo bal convergence properties for closely related rank minimization prob- 
lems Recht et al.l . l2010l |. Experimental results suggest the same conclusions for the algorithms 
considered in this paper, which means that further research on global convergence results is 
certainly deserved. 



8 Discussion 

This section presents connections with existing works and extensions of the regression model. 
8.1 Closeness-Based Approaches 



A standard derivation of learning algorithms is as follows [Kivinen and Warmuthl . Il997| . The 



(online) update at time t is viewed as an (approximate) solution of 

W m = argmin D(W,W t ) + s t £(y,y t ), (26) 
wew 

where D is a well-chosen measure of closeness between elements of W and St is a trade- 
off parameter that controls the balance between the conservative term D(W, Wj) and the 
innovation (or data fitting) term £(y,yt). One solves (|26p by solving the algebraic equation 

grad D(W,W t ) = -s t grad £(y t+ i,y t ), (27) 
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which is a first-order (necessary) optimality condition. If the search space W is a Riemannian 
manifold and if the closeness measure D(W, Wj) is the Riemannian distance, the solution of 
d23) is 

W m = Exp Wi (-s t grad £(y t+ i,y t )). 

Because yt+i must be evaluated in Wj+i, this update equation is implicit. However, yt+l is 
generally replaced by yt (which is equal to yt+l up to first order terms in Sf), which gives the 
update ([2]) where the exponential mapping is chosen as a retraction. 

Bregman divergences have been popular closeness measures for D(W,Wt) because they 
render the optimization of (|26l) convex. Bregman divergences on the cone of positive definite 
matrices include the von Neumann divergence 

D vN (W,W t ) = Tr(WlogW- WlogWt- W + Wt), 

and the LogDet divergence 

Di d (W,W t ) = Tr(WW^) - logdet(WW^ 1 ) - d. 

We have shown in Section [5] that the resulting updates can be interpreted as line-search 
updates for the log- Euclidean metric and the affine-invariant metric of S+(d) and for specific 
choices of the retraction mapping. 

Likewise, the algorithm (llOj) can be recast in the framework (126p by considering the close- 
ness 

D flat (W,W t ) = \\G-G t \\ 2 F , 

where W = GG T and Wj = GtGj . Algorithm ([2~T]) can be recast in the framework (|26p by 
considering the closeness 

r 

D pol (w,w t ) = x Y, e f + (i-A) II log Rr^RT 1 III- 

i=l 



wher e the #j's are the principal angles between the subspaces spanned by W and Wt Go lub and Van Loan 



1996l |. and the second term is the affine-invariant distance of S+(d) between matrices R 2 and 



R 2 involved in the polar representation of W and W(. 

Obviously, these closeness measures are no longer convex due to the rank constraint. 
However they reduce to the popular divergences in the full-rank case, up to second order 
terms. In particular, when A = 1, the su bspace is fixed and one recovers the setup of learning 
low-rank matrices of a fixed range space |Kulis et al.l . l2009l |. Thus, the algorithms introduced 



in the present pa per can be viewed as generalizations of the ones presented in the paper of 



Kulis et al.l 20091 ] , where the issue of adapting the range space is presented as an open research 
question. Each of the proposed algorithms provides an efficient workaround for this problem 
at the expense of the (potential) introduction of local minima. 

8.2 Handling Inequalities 

Inequalities y < y or y > y can be considered by treating them as equalities when they are 
not satisfied. This is equivalent to the minimization of the continuously differentiable cost 
function 

/(W) = i(y, y) = X - max(0, p(y - y))\ 
where p = +1 ify<yis required and p = —1 ify>yis required. 
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8.3 Kernelizing the Regression Model 



In this paper, we have not considered the kernelized model 

y = TY(W0(xMx) T ), 

whose predictions can be extended to new input data <^>(x) in the feature space T induced by 
the nonlinear mapping (f> : x E X i— > </>(x) E T . This is potentially a useful extension of the 
regression model that could be investigated in the light of re cent theoretical results in this 
area [for example Chatpatanasiri et all 20 id . Jain et al. . 2010 ] . 



8.4 Connection with Multidimensional Scaling Algorithms 

Given a set of m dissimilarity measures T> = {<%} m between n data objects, multidimensional 
scaling algorithms search for a r -dimensional embedding of the data obj e cts in to an Euclidean 



space representation G E M. nxr Cox and Coxl . 2001 . Borg and Groenen , 2005 1. Each row g of 



G is the coordinates of a data object in a Euclidean space of dimension r. 

Multidimensional scaling algorithms based on gradient descent are equivalent to algo- 
and (fTTj) when X = (e^ — ej)(ej — ej) T , where ej is the z-th unit vector (see 
, and when the multidimensional scaling reduction criterion is the SSTRESS 



rithms (flOl 
Section 19,1 



SSTRESS(G) = (II© -8*1 



Vectors gj and gj are the i-th and j-th. rows of matrix G. Gradient descent is a popular 
technique in the context of multidimensional scaling algorithms. A s tochastic gradient descent 
appro ach for minimizing the SSTRESS has also been proposed by Matsuda and Yamaguchil 



20011 ]. A potential area of future work is the application of the proposed online algorithm (|10p 
for adapting a batch solution to slight modifications of the dissimilarities over time. This 
approach has a much smaller computational cost than recomputing the offline solution at 
every time step. It further allows to keep the coordinate representation coherent over time 
since the solution do not brutally jumps from a local minimum to another. 



9 Applications 

The choice of an appropriate distance measure is a central issue for many distance-based classi- 
fication and clustering algorithms such as nearest neighbor classifiers, support vector machines 
or k-means. Because this choice is highly problem-dependent, numerous methods have been 
proposed to learn a distance function directly from data. In this section, we present two 
important distance learning applications that are compatible with the considered regression 
model and review some relevant literature on the subject. 



9.1 Kernel Learning 



In kernel-based methods [Shawe-Tavlor and Cristianinil . 120041 ] , the data samples Xi , . . . , x n are 
first transformed by a nonlinear mapping <j> : x E X i— > 4>(x) E J~, where J 7 is a new feature 
space that is expected to facilitate pattern detection into the data. 

The kernel function is then defined as the dot product between any two samples in J 7 , 

K(Xj,Xj) = <f>(xLi) ■ 0(Xj). 
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In practice, the kernel function is represented by a positive semidefinite matrix K £ W ixn 
whose entries are defined as Ky = </>(xj) • <j)(xj). This inner product information is used solely 
to compute the relevant quantities needed by the algorithms based on the kernel. For instance, 
a distance is implicitly defined by any kernel function as the Euclidean distance between the 
samples in the new feature space 



C^(Xj,X,) 



(*) 



Xj 



^( x i; x i) ~\~ K ( x j) x j) 2k(Xj,Xj), 



which can be evaluated using only the elements of the kernel matrix by the formula 
d<fi(xi,Xj) = Ka + Kjj - 2Kij = Tr (K(e, - 6j)(ej - ej) T ) , 

which fits into the considered regression model. 

Learning a kernel consists in computing the kernel (or Gram) matrix from scratch or 
improving a existing kernel matrix based on side-information (in a semi-supervised setting 
for instance). Data samples and class labels are generally exploited by means of equality or 
inequality constraints involving pairwise distances or inner products. 

Most of the numerous kernel learning algorithms that have been proposed work in the 
so-called transductive setti ng, that is, it is not possible to generalize the l earned kerne l func- 
tion to new data samples Kwok and Tsangj . 120031 . lLanckriet et all I2004J . iTsuda et all 120051 . 
Zhuang et all . 120991 . iKulis et all l2009| . In that setting, the total number of considered sam- 
ples is known in advance and determines the size of the learned matrix. Recently, algo- 
rithms have been proposed to learn a kernel function that can be extended to new points 
Chatpatanasiri et al. . 2010l . Jain et al. . 2010| . In this paper, we only consider the kernel 



learning problem in the transductive setting. 

When low-rank matrices are considered, kernel learning algorithms can be regarded as 
dimensionality reduction methods. Ver y popular unsupervise d algorithms in that context 
are kernel principal component analysis (Scholkopf et al. . 1998l | and multidimensional scaling 



Cox and Coxl . l200ll . iBorg and Groenenl . 120051 . Other kernel lea rning techniques include the 



maxi mum variance un folding algorithm Weinberger et all 120041 ] and i ts semi-supervis ed ver 



sion [Song et all 120071 ] , and the kernel spectral regression framework Cai et all 120071 ] which 
encompasses many reduction criterion (for example, linear discriminant analysis (LDA), local- 
it y preserving projection (LPP), neighborhood preserving embedding (NPE)). See the survey 
of lYangj [20061 ] for a more complete state-of-the-art in this area. 

Since our algorithms are able to compute a low-rank kernel matrix from data, they can be 
used for unsupervised or semi-supervised dimensionality reduction, depending whether or not 
the class labels are exploited through the imposed constraints. 



9.2 Mahalanobis Distance Learning 

Mahalanobis distances generalize the usual Euclidean distance as it allows to transform the 
data with an arbitrary rotation and scaling before computing the distance. Let Xj,x,- £ 
M. d be two data samples, the (squared) Mahalanobis distance between these two samples is 
parametrized by a positive definite matrix A £ M. dxd and writes as 

d A ( x i, x j) = ( x i - x j) T A (xj - Xj). (28) 

In the particular case of A being equal to the identity matrix, the standard Euclidean distance 
is obtained. A frequently used matrix is A = X! -1 , the inverse of the sample covariance matrix. 
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For centered data features, computing this Mahalanobis distance is equivalent to perform a 
whitening of the data before computing the Euclidean distance. 

For low-rank Mahalanobis matrices, computing the distance is equivalent to first perform 
a linear data reduction step before computing the Euclidean distance on the reduced data0 
Learning a low-rank Mahalanobis matrix can thus be seen as learning a linear projector that 
is used for dimension reduction. 

In contrast to kernel functions, Mahalanobis distances easily generalize to new data samples 
since the sole knowledge of A determines the distance function. 

In recent years, Mahalanobis distance learning algorithms have been the subject of many 
contributions that cannot be all enumerated here. We review a few of them, most relevant for 
the present paper. The first prop osed methods have been based on succes s ive pr ojections onto 
a set of large margin constraints King et all |2002| . IShalev-Shwartz et all l2004j |. The method 
proposed by iGloberson and Roweisl [20051 ] seeks a Mahalanobis matrix that maximizes the 
between classes distance while forcing to zero the within classes distance. A simpler objective 
is pursued by the al gorithms that optimize t h e Mahalanobis dist a nce for the specific fc-nearest 



neigh bor classifier [Goldberger et all I2004I . iTorresani and Led . 120061 . IWeinberger and Saull . 
l2009l | . Bregman projection based methods min i mize a particular Bregman d i verge nce under 



distance constraints. Both batch Davis et al 



20071 and online [Jain et all . 120081 ] formula- 
tions have been proposed for learning full-rank matrices. Low-rank matrices have also been 
considered wit h Bregman divergences but only when the r ange space of the matrix is fixed in 
the first place (Davis and Dhillonl . 120081 . iKulis et all 120091 ]. 



10 Experiments 



Data Set 


Samples 


Features 


Classes 


Reference 


GyrB 


52 




3 


Tsuda et al. [20051 


Digits 


300 


1G 


3 


Asuncion and Newman [2007] 


Wine 


178 


13 


13 


Asuncion and Newman [20071 


Ionosphere 


351 


33 


2 


Asuncion and Newman [20071 


Balance Scale 


625 


4 


3 


Asuncion and Newman [2007] 


Iris 


150 


4 


3 


Asuncion and Newman [20071 


Soybean 


532 


35 


17 


Asuncion and Newman [20071 


USPS 


2,007 


256 


10 


LeCun et al. [19891 


Isolet 


7,797 


617 


26 


Asuncion and Newman [2007] 


Prostate 


322 


15,154 


2 


Petricoin et al. [20021 



Table 2: Considered datasets 



In this section, we illustrate the potential of the proposed algorithms on several benchmark 
experiments. First, the proposed algorithms are evaluated on toy data. Then, they are 
compared to state-of-the-art kernel learning and Mahalanobis distance learning algorithms on 
real datasets. Overall, the experiments support that a joint estimation of a subspace and 
low-dimensional distance in that subspace is a major advantage of the proposed algorithms 
over methods that estimate the matrix for a subspace that is fixed beforehand. 



1 In the low-rank case, one should rigorously refer to (|28[> as a pseudo-distance. Indeed, one has dA.(xi, x.,) = 
with Xi 7^ xj whenever (x^ — Xj) lies in the null space of A. 
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Table [2] summarizes the different datasets that have been considered. As a normalization 
step, the data features are centered and rescaled to unit standard deviation. 

The implementation of the proposed algorithms^ as well as the experiments of this pa- 
per are performed with Matlab. The implementations of algorithms MVU jj KSR @ lmnnB 



and I T MlH have been rendered publi cly available by IWeinberger et ajj 



20071 . IWeinberger and Saull [2009| and iDavis et a l. 2007] respectively. 



2004l | . ICai et al. 
Algorithms POLA 



Shalev-Shwartz et all l2004l |. LogDet-KL |Kulis et all l2009l | and LEGO |.Tain et all l2008l | 



have been implemented on our own. 



10.1 Toy Data 

In this section, the proposed algorithms are evaluated on synthetic regression problems. The 
data vectors xj, ...,x n 6 M d and the target matrix W* 6 S+(r,d) are generated with entries 
drawn from a standard Gaussian distribution A^(0, 1). Observations follow 

y i = (xfW*x i )(l + i/ i ), i = l,...,n, (29) 

where Ui is drawn from M(0, 0.01). A multiplicative noise model is preferred over an additive 
one to easily control that observations remain nonnegative after the superposition of noise. 



10.1.1 Learning the Subspace vs. Fixing the Subspace Up Front 

As an illustrative example, we show the difference between two approaches for fitting the 
data to observations when a target model W* € S+(3, 3) is approximated with a parameter 
WG5+(2,3). 

A naive approach to tackle that problem is to first project the data Xj 6 K 3 on a subspace 
of reduced dimension and then to compute a full-rank model based on the projected data. 
Rece nt methods compute that subspace of reduced d i mensi on using principal component anal- 
ysis 



Davis and Dhillon . 20081 . Weinberger and Saull . 2009 ]. that is, a subspace that captures 



a maximal amount of variance in the data. However, in general, there is no reason why the 
subspace spanned by the top principal components should coincide with the subspace that 
is defined by the target model. Therefore, a more appropriate approach consists in learning 
jointly the subspace and a distance in that subspace that best fits the data to observations 
within that subspace. 

To compare the two approaches, we generate a set of learning samples {(Xj, y?)}f£i, with 
Xj E R 3 and yi that follows (|29p . The target model is 



W* = UAU T 

where U is a random 3x3 orthogonal matrix and A is a diagonal matrix with two dominant 
values An,A22 3> A33 > (for this specific example, An = 4, A22 = 3 and A33 = 0.01). 
Observations yi are thus nearly generated by a rank-2 model, such that W* should be well 
approximated with a matrix W £ S+(2, 3) that minimizes the train error. 



2 The source code is available from http://www.montefiore.ulg.ac.be/~meyer 
3 http : //www. cse . wustl . edu/~kilian/Downloads/MVU.html 
http : //www. cs .uiuc . edu/homes/dengcai2/SR/ 
5 http : //www. cse .wustl . edu/~kilian/Downloads/LMNN .html 
6 http : //www. cs .utexas . edu/users/pjain/itml/ 
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• Data 
□Target model subspace 

□ PCA subspace 

□ Learned subspace 





Figure 3: Learning vs fixing the subspace. Top: the learned subspace is very different from 
the subspace computed from a classical heuristic. Bottom left: fit after projection of the 
data onto a subspace fixed up front. Bottom right: fit obtained with a join estimation of 
the subspace and a distance within that subspace. 
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Results are presented in Figure [3l The top plot shows that the learned subspace (which 
identifies with the target subspace) is indeed very different from the subspace spanned by 
the top two principal components. Moreover, the bottom plots clearly demonstrate that the 
fit is much better when the subspace and the distance in that subspace are learned jointly. 
The difference is also significant in terms of the train error. This simple example shows that 
heuristic methods that fix the range space in the first place may converge to a solution that 
is very different from a minimum of the desired cost function. For visualization purpose, the 
two dimensional model is represented by the ellipse 

"(jT x 

£ = {xj £ M 2 : xf R 2 Xj = 1}, where 5q = -, 

and (U,R 2 ) are computed with algorithm (I22p . either in the setting A = that fixes the 
subspace to the PCA subspace (left) or in the setting A = 0.5 that simultaneously learned U 
and B (right). A perfect fit is obtained when all iq are located on £, which is the locus of 
points where yi = y%. 



10.1.2 Influence of A on the Algorithm Based on the Polar Geometry 




Figure 4: Influence of A. 



In theory, the parameter A should not influence the algorithm since it has no effect on 
the first-order optimality conditions except for its two extreme values A = and A = 1. In 
practice however, a sensitivity to this parameter is observed due to the finite tolerance of the 
stopping criterion: the looser the tolerance, the more sensitive to A. 

To investigate the sensitivity to A, we try to recover a target parameter W* £ 5+ (5, 10) 
using pairs (x,,y,) generated according to (129|) . We generate 10 random regression problems 
with 1000 samples partitioned into 500 learning samples and 500 test samples. We compute 
the mean test error and the mean convergence time as a function of A for different values of 
etoi- The results are presented in Figure 3J As e to \ decrease, the test error becomes insensitive 
to A, but an influence is observed on the convergence time of the algorithm. 

In view of these results, we recommend the value 0.5 as the default setting for A. Unless 
specified otherwise, we therefore use this particular value for all experiments in this paper. 
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10.1.3 Online vs. Batch 




10" 



5 10 15 

Time (in seconds) 

(a) Flat geometry 
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Figure 5: Online vs Batch. For a large number of samples, online algorithms reduce the test 
error much more rapidly than batch ones. Using the mini-batch extension generally improve 
significantly the performance. 



This experiment shows that when a large amount of sample is available (80, 000 training 
samples and 20, 000 test samples for learning a parameter W* in ,S+(10, 50)), online algorithms 
minimize the test error more rapidly than batch ones. It further shows that the mini-batch 
extension allows to improve significantly the performance compared to the plain stochastic 
gradient descent setting (p = 1). We observe that the mini-batch size p = 32 generally gives 
good results. Figure [5] report the test error as a function of the learning time, that is, the time 
after each iteration for batch algorithm and the time after each epoch for online algorithms. For 
the algorithm based on the polar geometry, the mini-batch extension is strongly recommended 
to amortize the larger cost of each update. 



10.2 Kernel Learning 

In this section, the proposed algorithms are applied to the problem of learning a kernel matrix 
from pairwise distance constraints between data samples. As mentioned earlier, we only 
consider this problem in the transductive setting, that is, all samples xi, ...x n are available up 
front and the learned kernel do not generalize to new samples. 



10.2.1 Experimental Setup 

After transformation of the data with the kernel map x i— > <fi(x), the purpose is to compute a 
fixed-rank kernel matrix based on a limited amount of pairwise distances in the kernel feature 
space and on some information about class labels. 

Distance constraints are generated as < y%j(l — a) for identically labeled samples 
and yij > J/«(l + a) for differentially labeled samples, where a > is a scaling factor, 
Vij = ||0(xj) - <Kxj)|| 2 and fa = Tr(W(ei - e,)(ej - ej) T ) = (e* - e i ) T W(e i - e,). 
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We investigate both the influence of the amount of side-information provided, the influence 
of the approximation rank and the computational time required by the algorithms. 

To quantify the performance of the learned kernel matrix, we perform either a classification 
or a clustering of the samples based on the learned kernel. For classification, we compute the 
test set accuracy of a fc-nearest neighbor classifier (k = 5) using a two-fold cross-validation 
protocol (results are averaged over 10 random splits). For clustering, we use the K- means 
algorithm with the number of clusters equal to the number of classes in the problem. To 
overcome K-means local minima, 10 runs are performed in order to select the result that has 
lead to the smaller value of the K-means objective. The quality of the clustering is measured 
by the normalized mutual inform ation (NMI) shared between the random variables of cluster 
indicators C and target labels T Strehl et al. . 2000| . 



NMI 



2 J(C;T) 
(H(C) + H(T)Y 



where I(X\;X2) = H(X\) — H(X\\X2) is the mutual information between the random vari- 
ables X\ and X2, H{X\) is the Shannon entropy of X±, and H(X\\X2) is the conditional 
entropy of X\ given X<i- This score ranges from to 1, the larger the score, the better the 
clustering quality. 



10.2.2 Compared Methods 

We compare the following methods: 

1. Batch algorithms (jlip and ([22]) . adapted to handle inequalities (see Section [8T2 



The kernel learning algorithm LogDet-KL (Kulis et all 120091 ] which learn kernel matrices 
of fixed range space for a given set of distance constraints. 



The kernel spectral regression (KSR) algorithm of ICai et al.l (20071 ] using a similarity 
matrix N constructed as follows. Let N be the adjacency matrix of a 5-NN graph based 
on the initial kernel. We modify N according to the set of available constraints: Njj = 1 



if samples x.; and Xj belong to the same class (must-link constraint), Njj 
Xj and Xj do not belong to the same class (cannot-link constraint). 



if samples 



The Maximum Variance Unfolding (MVU) algorithm [Weinberger et all l2004j |. 



5. The Kernel PCA algorithm [Scholkopf et all Il998l |. 
The last two algorithms are unsupervised techniques that are provided as baselines. 



10.2.3 Results 

The first experiment is reproduced from Tsuda et al. 2005 1 and Kulis et al. 2009]. The goal 
is to reconstruct the GyrB kernel matrix based on distance constraints only. This matrix 
contains information about the proteins of three bacteria species. The distance constraints 
are randomly generated from the original kernel matrix with a = 0. We compare the pro- 
posed batch methods with the LogDet-KL algorithm, the only competing algorithm that also 
learns directly fro m distance constraints. This algorithm is the best performer reported by 
Kulis et al. 20091 ] for this experiment. All algorithms start from the identity matrix that do 
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Figure 6: Left: full-rank kernel learning on the Gyrb data set. The algorithm based on the 
polar geometry competes with LogDet-KL. Right: low-rank kernel learning on the Digits 
data set. The proposed algorithms outperform the compared methods as soon as a sufficiently 
large number of constraints is provided. 



not encode any domain information. Figure [6] (left) reports the fc-NN classification accuracy 
as a function of the number of distance constraints provided. In this full-rank learning set- 
ting, the algorithm based on the polar geometry compete with the LogDet-KL algorithm. The 
convergence time of the algorithm based on the polar geometry is however much faster (0.15 
seconds versus 58 seconds for LogDet-KL when learning 1000 constraints). The algorithm 
based on the flat geometry has inferior performance when too few constraints are provided. 
This is because in the kernel learning setting, updates of this algorithm only involve the rows 
and columns that correspond to the set of points for which constraints are provided. It may 
thus result in a partial update of the kernel matrix entries. This issue disappears as the 
number of provided constraints increases. 



The second experiment is reproduced from iKulis et al.l [2009fl. It aims at improving an 
existing low-rank kernel using limited information about class labels. A rank-16 kernel matrix 
is computed for clustering a database of 300 handwritten digits randomly sampled from the 
3, 8 and 9 digits of the Di gits dataset (sinc e we could not find out the specific samples 
that have been selected by Kulis et al. 2009l |. we made our own samples selection). The 
distance constraints are randomly sampled from a linear kernel on the input data K = XX T 
and a = 0.25. The results are presented in Figure [6] (right). The figure shows that KSR, 
LogDet-KL and the algorithm based on the polar geometry with A = perform similarly. 
These methods are however outperformed by the proposed algorithms (flat geometry and polar 
geometry with A = 0.5) when the number of constraints is large enough. This experiment 
also enlightens the flexibility of the polar geometry, which allows us to fix the subspace in 
situations where too few constraints are available. 

Finally, we tackle the kernel learning problem on a larger data set. We use the test set of 
the USPS datasetjll which contains 2007 samples of handwritten zip code digits. The data are 
first transformed using the kernel map «(xj,Xj) = exp(— 7||xj — Xj|||) with 7 = 0.001 and we 
further center the data in the kernel feature space. Pairwise distance constraints are randomly 



7 We use the ZIP code data from http: //www-stat-class . Stanford. edu/~tibs/ElemStatLearn/ data, html 
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Figure 7: Clustering the USPS data set. Left: clustering score versus number of constraints. 
Right: clustering score versus approximation rank. When the number of provided constraints 
is large enough, the proposed algorithms perform as good as the KSR algorithm. It outper- 
forms the LogDet-KL algorithm and baselines. 



sampled from that kernel matrix with a = 0.5. Except KSR that has its own initialization 
procedure, algorithms start from the kernel matrix provided by kernel PCA. 

Figure [7] (left) shows the clustering performance as a function of the number of constraints 
provided when the approximation rank is fixed to r = 25. Figure [7] (right) reports the 
clustering performance as a function of the approximation rank when the number of constraints 
provided is fixed to lOOiT. When the number of provided constraints is large enough, the 
proposed algorithms perform as good as KSR and outperform the LogDet-KL method that 
learn a kernel of fixed-range space. Average computational times for learning a rank-6 kernel 
from WOK constraints are 0.57 seconds for KSR, 3.25 seconds for the algorithm based on 
the flat geometry, 46.78 seconds for LogDet-KL and 47.30 seconds for the algorithm based on 
the polar geometry. In comparison, the SDP-based MVU algorithm takes 676.60 seconds to 
converge. 



10.3 Mahalanobis Distance Learning 

In this section, we tackle the problem of learning from data a Mahalanobis distance for super- 
vised classification and compare our methods to state-of-the-art Mahalanobis metric learning 
algorithms. 



10.3.1 Experimental Setup 

For the considered problem, the purpose is to learn the parameter W of a Mahalanobis distance 
d\y(xj, x i) = ( x t — x j) T W(xj — Xj), s uch that the distan ce satisfies as much as possible a given 
set of constraints. As in the paper of Davis et al. 20071 ] . we generate the constraints from the 



learning set of samples as <iw( x i> x j) < I f° r same-class pairs and dw( x i, x j) > u f° r different- 
class pairs. The scalars u and I estimate the 95-th and 5-th percentiles of the distribution 
of Mahalanobis distances parameterized by a chosen baseline Wo- The performance of the 
learned distance is then quantified by the test error rate of a /c-nearest neighbor classifier 
based on the learned distance. All experiments use the setting k = 5, breaking ties arbitrarily. 
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Unless for the Isolet data set for which a specific train/test partition is provided, error rates 
are computed using two-fold cross validation. Results are averaged over 10 random partitions. 

10.3.2 Compared Methods 

We compare the following distance learning algorithms: 

1. Batch algorithms (JTT]) and (J22j) , 

2. ITML 



Davis et all 12007 1 



3. LMNN (Weinberger and Saul l2009l | . 

4. Online algorithms ([10]) and (f2~Tj) . 



5. LEGO [Jain et all 120081 ] . 



6. POL A [Shalev-Shwartz et all 12004 1. 



When some methods require the tuning of an hyper-parameter, this is performed by a two-fold 
cross-validation procedure. The slack parameter of ITML as well as the step size of POLA are 
selected in the range of values 10 fc with k = —3, 3. The step size of LEGO is selected in this 
same range of value for the UCI datasets, and in the range of value 10 fc with k = —10, —5 
for the larger data sets Isolet and Prostate. 



10.3.3 Results 



CB Batch flat geometry 

□ Batch polar geometry 

| | ITML 

□ LEGO 

□ POLA 

□ LMNN 
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Figure 8: Full-rank distance learning on the UCI data sets. The proposed algorithms compete 
with state-of-the-art methods for learning a full-rank Mahalanobis distance. 



Reproducing a classical benchmark experiment from iKulis et al.l [2009|, we demonstrate 
that the proposed batch algorithms compete with state-of-the-art full-rank Mahalanobis dis- 
tance learning algorithms on several UCI datasets (Figure [8]). We have not included the online 
versions of our algorithms in this comparison because we consider that the batch approaches 
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are more relevant on such small datasets. Except POLA and LMNN which do not learn from 
provided pairwise constraints, all algorithms process 40c(c — 1) constraints, where c is the 
number of classes in the data. We choose the Euclidean distance (Wq = I) as the baseline 
distance for initializing the algorithms. Figure [8] reports the results. The two proposed algo- 
rithms compete favorably with the other full-rank distance learning techniques, achieving the 
minimal average error for 4 of the 5 considered data sets. 
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Figure 9: Low-rank Mahalanobis distance learning. For low values of the rank, the proposed 
algorithms perform much better than the methods that project the data on the top principal 
directions and learn a full-rank distance on the projected data. 



We finally evaluate the proposed algorithms on higher-dimensional data sets in the low- 
rank regime (Figure [9]). The distance constraints are generated as in the full-rank case, but 
the initial baseline matrix is now computed as Wo = GoGq , where Go's columns are the top 
principal directions of the data. For the Isolet data set, WOK constraints are generated, and 
10K constraints are generated for the Prostate data set. For scalability reasons, algorithms 
LEGO, LMNN and ITML must proceed in two steps: the data are first projected onto the 
top principal directions and then a full-rank distance is learned within the subspace spanned 
by these top principal directions. In contrast, our algorithms are initialized with the top 
principal direction, but they operate on the data in their original feature space. Overall, the 
proposed algorithms achieve much better performance than the methods that first reduce the 
data. This is particularly striking when the rank is very small compared to problem size. The 
performance gap reduces as the rank increases. However, for high-dimensional problems, one 
is usually interested in efficient low-rank approximations that gives satisfactory results. 



11 Conclusion 

In this paper, we propose gradient descent algorithms to learn a regression model parameter- 
ized by a fixed-rank positive semidefinite matrix. The rich Riemannian geometry of the set of 
fixed-rank PSD matrices is exploited through a geometric optimization approach. 

The resulting algorithms overcome the main difficulties encountered by the previously 
proposed methods as they scale to high-dimensional problems, and they naturally enforce the 
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rank constraint as well as the positive definite property while leaving the range space of the 
matrix free to evolve during optimization. 

We apply the proposed algorithms to the problem of learning a distance function from 
data, when the distance is parameterized by a fixed-rank positive semidefinite matrix. The 
good performance of the proposed algorithms is illustrated over several benchmarks. 
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A Convergence Proof of Algorithm ( TTU 



Bottoul 1998 ] reviews the mathematical tools required to prove almost sure convergence, that 



is asymptotic convergence with probability one, of stochastic gradient algorithms. Almost 
sure convergence follows from the following five assumptions: 

(Al) F(G) = Ex,j/{^(y, y)} > is three times differentiable with bounded derivatives, 
(A2) the step sizes satisfy YluLi Vt < 00 anc ^ 'ft = 00 ' 

(A3) E x ,,{||grad/(G)||2,} < h + k 2 \\G\\ 2 F , where f(G) = £(y,y), 

(A4) 3h! > 0, inf Tr(G T E x , J/ {grad/(G)}) > 0, 

||G[||>Ai 

(A5) lk 2 >h 1 ,V(X,y)€Xxy, sup ||grad/(G)|| F < A*, 

\\G\\%<h 2 

where \\-\\f is the Frobenius norm. Provided that algorithm (|10[) is equipped with an adaptive 
step size St = r/t/ max(||Gj|||i, 1), where r\t satisfy (A2), we have the following convergence 
result. 

Proposition A.l. For bounded data (X, y), algorithm fjlO[) equipped with the step size Sf 
defined above converges almost surely to the set of stationary points of the cost function 
E Xi2/ {(y-y) 2 /2}. 

Proof. The proof is completed in two steps. First, it is shown that the stochastic sequence 

u t = max(/i 2 , HGjIIp), 

defines a Lyapunov process (always positive and decreasing on average) which is bounded 
almost surely by h 2 - This implies that G^ is almost surely confined within distance y/ h-z from 
the o rigin and provides almost sure bounds on all continuous functions of G^. In iBottou 



1998J] , confinement is essentially based on (A3) and (A4). In the current proof, we rely on 
the fact that E x ,y{||grad/(G)/max(||G|||, 1)|||} < h + fc 2 ||G|||. 

Second, the Lyapunov process Vt = F(Gt) > is proved to converge almost surely. Con- 
vergence of F(Gt) is then used to show that w ± = gra d F(G t) tends to zero almost surely. 



Technical details are adapted from the paper of IBottou! [19981 ] . □ 
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In practice, saddle points and local maxima are unstable solutions while convergence to 
asymptotic plateaus is excluded by (A4). As a result, almost sure convergence to a local 
minimum of the expected cost is obtained. 
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